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Abstract 


A necessary condition for a real valued Frechet differentiable function 

of a vector variable to have an extremum at a vector x is that the Frechet 

o 

derivative vanishes at x^. This paper establishes a relationship between 
Frechet differentials and matrix derivatives obtaining a necessary condition 
on the matrix derivative at an extrema. These results are applied to various 
scalar functions of matrix variables which occur in statistical pattern 
recognition. 


Applications of Matrix Derivatives To 
Optimization Problems In Statistical 
Pattern Recognition 


1, Introduction . 

Let S be a transformation on a normed space X to a normed space Y. 

If for , X £ X , there Is a bounded linear operator A e B(X,Y) such that 

1 1 ) 11 „ IIS(Kth) - S(x) - A(h)|| . u 

IlhIN INI 

then S Is Frechet differential . at x. Th 2 vector A(h) Is referred to as the Frechet 
differential of S at x with Increment h and A Is denoted by 5S(x, ) 
or S'(x). 

We list below some important properties of the Frechet differential. ■ 

For proofs and a detailed treatment of Frechet differentials see [6 ,pp. 175-178] 

Theorem l.li If S has a Frechet differential, It Is unique 

Theorem 1.2: If S has a Frechet differential at x, then S Is continuous 
at X. 

Theorem 1.3: Let f be a real valued function which is Frechet differentiable 

at X e X. If f has an extrema at x , then 6f(x ,h) «= 0 for all h g X. 
o o o 

Example 1.4: Let X « r" where R Is the field of real numbers and let 

f(x) *= f (Xj^, . . . ,x^) be a real valued function on X having continuous partial 

derivatives existing with respect to each variable x^. Then the Frechet 


differential Is 


6f (x,h) « £ 1“ h.. 

1=X "l jj 

We denote by M the vector space of real rxs matrices. For 

A £ M we denote the element In the 1th row and jth column of A by 
rK« 

n m 

<A>. Let tr(A) = .Ei<A>.., the usual trace of A, and let A^ denote 
Ij 1-1 11 

the transpose of A. 

The set M Is a normed space with 

1 1 a] I - ltr(AA^l^^^» iL E (<A>..)^]^^^ for 

iml j«l 


A 6 M .An Inner product compatible with this norm Is given by 
r 6 

y 9 M 

(A,B) •» Yi E"^A >4 4 * <®>4 4 ” tr(A‘B ). 

1«1 jml 


Let 

and let 


A £ M 


Pxq 


3<A> 


3<X> 


ii 




have each entry a function of the entries of 
exist for all lilSp, isjsq, l£Y^®* 


M 


mX 


1 6 £ n, 


We define 


3A m 

3<X>y 5 ® pxq 


and 


3<A> 


li 


dx 


£ M 


mxn 


by 


<M N 

' s<x> J. 


3<X> 


y& Ij 


3<X>^5 


3<A> 


^3X 




ys 


We make the convention that all partlals are taken considering the entries of X 


as being independent unless otherwise specified. For example 


Example 1.5: the m n matrix with 


^^ 5 ) ij 


1 if Y “ i and 6 ® j 


0 otherwise 


. The above holds even in the 


symmetric case due to our convention. For future reference we denote by K, . 

1 if i - Y and j = 5 

the pxq matrix with vR “ 

^ 0 otherwise 

J. the ni<l vector with 1 in the jth component and zero elsewhere. 

•J 


Example 1.6: Let Y » tr(X) and X e . Then the Identity 




JiY 3Y 

One writes Instead of ^ or g^^^ - ^ if and only if X or Y is 


a scalar. 


Example 1 .7t Let |x| denote the determinant of X. Then ' 3 <x> ' “ cof(<X>Y^). 


iM,. 


cof(X) and if; X is full rank; 


= |X|X’’^, whex- ; ^ = (X“^)’'. 


Several equations are listed below which are easily verified using com- 
ponent-wise arguments. 

Let y “ f(X) be a scalar function of X e ^ scalar function 

!•« lx ‘I f 'lie “ I7 lx 

Example 1.8: If X is full rank and t e R; then 


|Mi.t|xi'-i|ixi-.tixrx-". 
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Let Y(X) e and W(X) £ . Then 

*X8 


1.3) 


3<W.!.X). . W 11., 

3<X> > ” 3<X> 

Yo 


dw 


y6 »<^>y6 


3y . 3w 

if A^viw. find 


a<x>, 


y6 


3<X> 


y6 


exist . ' 


If £(X) is full rank and exist, then by equation 1.3) we have 

y& 


1.4) 


3<X>Y6 3<X>y6 


Example 1.9: If X is full rank, then 

lOib. « .v“^ J X"^ 
3<X>y 5 * '^Yfi * 


Let Y - Y(X) e M and w » w(Y) c M 

PXq ' ' rus 

indicated derivatives exist , then 


with X e M ^ . If the 
mxn 


1.5) 


3w 


3<X> 


Y6 


E D |H — . - ■ 

1-1 j-l 3<Y>^J 3 <X>y^ 


and; if w is scalar valued; 


3w ^ ^ 3w 

3x ” 4t 


3<Y> 




dX 


il 
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In his excellent paper on matrix derivatives Dwyer proves an extremely useful 

3<^>i^ 3 Y 

theorem establishing a procedure for calculating ^ when is oi ® 

y6 

certain form. We state the theorem without proof. For a proof of this 
theorem see Dwyer [4, p 612]. 


Theorem 1.10; Let X e ^ e *^pxL* 


a y 


3<X>, 


y6 




If and only if 


a<Y> 


3 X 




All matrix multiplications must be defined when applying this theorem. This 
condition must also be observed In applying the following corollary which Is a 
restatement of the theorem for the scalar valued case. 


Corollary 1.11: Let f be scalar valued. Then 




3<X>, 


y6 


E 

q 


B 

Yo q 


E 

h 


'h 




if and only If 


9x “ q q h h 
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2. The Frechet Differential of Matrix Valued Functions of a Matrix Variable. 


’The following theorem establishes a relationship between the Frechet 


differential and the matrix derivatives. 


Theorem 2.1: Let f be an operator from to M 


3<f(X)>, 


with 


pKq 9<X> 


continuous for 1^ i^p, l^J^qt l^Y^nii 1^6^ n, with all entries 
of X independent. Then f is Frechet differentiable and 




Proof: 6f(X‘,H) is obviously linear in U. 




Applying the Cauchy inequality. 


ll«(X.H)|l . (S.H > (Elfclf - llBlI ElfcllM 

Y,6 ^ YjS Y<S Yr® Y* 


Thus 5f(X,«) is bounded with operator norm 


ii«(x..)ii s ( EiiHie- II 

Y,o 


Let H e M . For isism, isj£n let 
mxn 
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and 


'J H^i. k=l kij ^ ^ 

for 1 < 1 s R. For Xc f(X+H> - f(X) « f(X-KJ^ j) - 

and (X-KJ^^j) - (X+G^^j_p “ <H>^^ . 


3<f(X)> 


k£ 


Using the mean-value theorem and the continuity of i we have 

for each e > 0^ a 6 > 0 such that | [hI [ < d Implies 


3<f(X)>j^ 


l<f^^'^i,j^>kil" " 3<X>^j ^ 4pqin' ' 


for all isiSm, isjSn, ISk^p, iSJt^q. 

Let e > 0 and |)h 1| < 6 for 6 > 0 described above. Then 


1. 1 £ <x+H) -. £ (X) -5 fXXj Hill ^ _i_j| E f(x-Hj .)-f(x-K; . J- <H>^.| 


l|H| 




<r 8<f(X)>M , 


< E .2 ( < e. 

l.J ll.ll ‘pl”" 


Thus 


11m 

IlHiN 


JJA(X+H) - f(X) - ^f(X>jI).l.L = Q theorem is proved. 

I|H|| 


NJ||-* 
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If the entries of X are not independent then a parallel argument will 
show that 6f(X,H) « where the summation is over all 


Y6 




for which independent and taken with the entries of X 


not being considered as independent. 

If f Is a scalar valued function of a matrix X and is Frechet differentiable 
at Xj where all entries of X are Independent the conclusion of the theorem 
can be stated as 6f(X>H) “ inner product of H with the 

matrix derivative of f at X. It is shown below that this relationship holds 
in certain cases in which the eutvies of X are not independent. 


Theorem 2.2; Let f be a scalar valued Frechet differentiable "function of a 
symmetric matrix. Then for symmetric H, 

6f(X,H) « (1^^^ ,H). 

o .. 


Proof.. From the statement following Theorem 2.1 we have 


6f(X,H) 


3f 


i£j 3<X> 


ij 


• <H>^^ , where 


is taken without the usual convention of treating the entries of X as being 

SffXl ^ 

independent. From' elementary calculus/ ' 9<X> '' 3<X>" ^' 

taken by treating all entries as being independent. When X is 

^ Y5 


symmetric and 5 Y> 


9 


3<X>^5 


iOiL.+ 

3<X>^5 


him.. 

3<X>y5 


and 


9l(X) „ 9 £(X) 


3<X> 


YY 


3<X>. 


YY 


since it is assumed that symetery is the only dependency condition. Thus 


«f (X,H) . S ^ + S l|^ + ^ 


9<X>^^ 




»t(X) . <B> 
3<X>.. ij 


i.J '''"'ij 


("3X^ ‘ ^ ,H) and the theorem is proved. 


If f satisfies the hypothesis of Theorem 2.2 and X and H are diagonal; 

3f 

then a similar argument proves that 6f(X,H) « 

The matrices A,B e orthogonal if (A,B) = 0. The only matrix 

M which is orthogonal to every matrix in M Is 0, the zero matrix, 
mxu mxn 

The only symmetric matrix orthogonal to every orthogonal matrix is 6, and the 
only diagonal matrix orthogonal to every diagonal matrix is 0. An immediate 
consequence of the above facts and Theorem 2.2 is 


Corollary 2.3: Let f be a scalar valued function of a matrix X which has 

all entries independent, is symmetric , or is diagonal and satisfies the hypothesis 
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of Theorem 2.2. If f has an extreme value at X^, then the matrix derivative 
of f vanishes at Xq. 

3. Applications to MTjSE . MLEST . and PMC . 


In the remainder of this paper the following notation will be used. Let 
{v } „ be a collection of N samples each having n features 

ic ♦ f ^ 

e r") . with the samples taken from a mixture of m classes. Let 
'{a ,y. .S.lj be signature parameters where a. e R, y. e r” and E. e M 

are respectively, the a priori probability, the mean vector, and the covariance 
matrix for observations from the 1th class. The distribution p(xj^) is norrual 
with 

m 

i’<v ■ 


where the conditional density function. given by 


/ ^ 1 “ 1/2 Ti 

where 

'■i “ <*k - - *‘i>- 

If some of the signature parameters are known, a maximum likelihood signature 
estimate (MLSE) is a choice of the remaining parameters which maximize the 
log-likelihood function 

N 

h *= Z] log(p(x.)). 
k=l 


n 


We now obtain Che like H hood cqu ntlo ns by taking the matrix derivative of 
L with respect to the means and covariance In each class and applying 
Corollary 2.3. 




From Example 1.8, 


■wr ^ ‘ • 


- 1/2 


Since 


5 

s<r. 


^ - (x^ - - yp, 


then by Corollary 1.11, 


gr 

^ - PiHx^ - yp’^ Thus 


3h V- .. ^i^V , 1 ^-T . 1 ^-T 

3f 


.T ,.-T, 


^i ^ iS. 2 ^i ~ ^i^ 


Applying Corollary 2.3 and the symmetry of wo have 


k?i W ‘ i k?i “Frxp'-k - ’ 
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and hence 


3 . 1 ) 2 . 


, N P4<xJ T / 1 JL 

‘ii ■ “i’ Y ^1 p< 


at extrema. 


Next 


where 


Hence 


N a 




I “ ^1 ^ W ’ 




avij *" * 


‘1 k=l 


At extrema we obtain 


3L „ ^ (V ^i^!!islrr"^(v - u )] 

■55T A “i p(*^)'^t V' 




Pi<’‘k’, 

P(x^)' 


3 . 2 ) 
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By use of a Lagrange multiplier to enforce the constraints ** 1 

1“1 

and a 0, the following expression is obtained at extrema: 

“i ^ Pl^V 


Equations 3.1), 3.2), 3.3) are the likelihood equations which serve as 
a point of departure for the results on MLSE by Walker and Peters in [8]. 

Maxliaum likelihood estimation of signature transformation (MLEST) is 
a procedure that adjusts signatures from a training segment to compensate for haze and 
sun angle; assuming that the adjustments are given by an affine transformation 
Xj^ " Ayj^ + b; where A is an nKn non-singular matrix and b is an n-vector 
which transforms yj^, the kth pixel from the training segment to x^, the 
kth pixel in the recognition segment. Using this transformation; a set of 
parameters for the recognition setment is jjtained as follows: 


Uj «= + b 


I m 

h - 


The conditional density function of the transformed 1th class is 




-i/2r, 


where 
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rj^ « (x^ - A^i - b)^(Ai:^A’^)~^(Xj^ - k\i^ - b). 


The transformed likelihood function is 


L' “ 53 log(p’(x.))/ 
k“l 


-;(v 


Throughout the remainder of this discussion the primes will be suppressed. 


t - E E 


a A “i 


iSi 1^1 


3A 


, -1/2PJ SlAE ^ -i/2r sr 

-i-r:rre ^ + „ , e 4 


Iaz/I 


Til/2 




“ -uj J^5 (AZ:/)-"(xj^ - Au^ - b) + (Xj^ - AU^ - b)^(Az/)-^-J^^U,) 


+ - - b)^(-AS^A’^)"^(J^gE^A’^ + AE^J^^)(AI:^a'^)"^(Xj^ - Ay^ - b). 


Thus 


15 


= -2(AZ^a’*^)'^(x^ ' AiJ^ - b)[uj + (Xj^ - AU^ ~ b)*A""] 


T.-T, 


-2(AI^a’^)"^(Xj^ - Ay^ - b)(x^ - b)V^’. 




ajAZj^A 


Ti-1/2 


- lAl^A’^r^^^{AE^A^)"^AE^ » - [AZj^A’^r-^' V^; 


Trl/2.-T, 


since A la Invertible. Thus 


N m 


' k?x 1?1 - ‘“i - 


At extrema we have 


N m p (x. ) 

t£ t, -HtJTli 

k-l M *• ■’'■*k' 


N m Pi(xJ -1 >1 T - 

L “i -log - '‘>'1 - 


k=l 1«1 


Hence 


1 ^ ^ . “i^V 


N “l P(x^)^*k ■ *’’^*k ■ *'*1 
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N m 


<=1 1=1. ^ p(x^) 2 3b 


(Xj^ - Ay^ - b)'^(AX^A’^)’^(Jj) + (-Jj)(AI:^a'^)”^(X|^ - Aji^ - b) 


and hence 


''*4 T —1 

■Jgi - -2(AE^A^) '■(Xj^ - AUj - b). 


tE E », -1 „ f, f. „ VV -1, . j 

k»l 1=1 ^ p(x^) ^^^1^ ^ 1 p(xj^) ^^1^ ^ *^1^ 


at extrema and 

N m p , (x. ) , . N m ^ - 

3.5) b = A{5: D a E e;v\x^-au )]. 

k=l 1=1 ^ P^V k=l 1=1 ^ K 1 

Equations 3.3), 3.4), and 3.5) are the transformed likelihood equations 
Which serve as a starting point for recent results on MLEST obtained by McCabe 
and Solomon In [7]. 

In [5] Walker and Guseraan show that probability of mis classification 
(PMC) of a transformed observation is a differentiblc function of the kxn 

feature selection matrix B which transforms normally distributed observations 

n k 

In R" to normally distributed observations in R . Calculating the differentials 

of the PMC is a direct result of calculating the differentials of the transformed 

density function. By making use of previous calculations we give an abbreviated 
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version of these calculations. 

The transformed density function Is 


p(x,B) 


1 - ^-U2V^ 


where 


r » (x -- By)’^(BEB’^)“^(x - By), 


From the calculations In the MLEST problem, 


and 


- -2(BEB^)"^(x - ByHy"^) - 2 (BEb’^)7^(x - By) (x - By)^(BZB'*’)"^E. 


Hence 


= - p(x,B){(BZb'‘^)"^(BZ) (BIb'^)“^(x - By)[y'^ + (x - By)’^(BZB’^)'Hz] }. 

The Frechet differential of p at (x,B) with Increment C Is 
and using the above expression for It . Is easy to show that 
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(6p(x,B),C) - -p(x,B){tr[CZB'^(BEB'‘^)'^ 

- (x-B)*’^(B2B’^)”^[Cy +^(028*^ + BZc'*^) (B2B^)"^(x-Bp) ]}, 


which is the form of the differential obtained. in [5]. 


4« Matrix Derivatives of traces and applications! 

Ve begi^.n this section with the main theorem. 

Theorem 4.1: Let u = u(Y) be a scalar valued function of Y = Y(X) where 

Y € M ^ and X e If ^ and exist and 

p^r mKn dY 


- L \ V ®, ^ •>V6 ”h- 


Y6 q 


V' aT ,3u.„T . V n 

3X “ ^ ^3Y^®q ^ \ ^3Y^ ^h* 


(Remark: As in previous theorems on matrix derivatives^ it is essential to 

determine if the aforementioned side condition concerning matrix multiplication 
is satisfied.) 


Proof: From equation 1.5, 


3u ^ y' 3u 

3X " 3<Y> 

1,J ® Ij 


y /Ins _JJ 

>3Y^1J 3X 
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By applying Theorem 1.10 to the hypothesis^ we have 


3<Y> 


3X>: 


il 


^ q ij q h 


Thus 


3x - - ay ij aJ * 1 \ kIj <i.> 


3u Y' 
av “ ^ 


? * ? ”'‘'5 * 


A^d?)®! + r \(|?)’^ e. 


q q ® * q j| 


and the theorem Is proved. 

The following result (due to Dwyer In [2]) Is the most useful form of 
the above theoremj especially in applications to multivariate statistical analysis. 

Corollary 4.2; Let f(X) be a matrix valued function of a matrix X. If 


3f(X) 


3<X> 


y6 


? \ ? ‘'h 


then 


- E aV + £ D. a . 


Proof: From Example 1.6 we have Now apply the theorem. 


20 


The Corollary above provides an effective tool for solving optimization 
problems dealing with the trace function. To Illustrate the strength of the 
coroUary|We obtain some short cuts to finding necessary conditions for extrema 
for some Important trace functions. 

In [10] Qulreln. using extensive and tedious calculations, obtained the 
derivatives of 


(|> « trCBAfi’^) - tr{M^[(BAB’’^)“^ - (BDb'*^)"^ - 1^^]} 


with respect to B and where B Is a kKn matrix of rank k, A Is an 
nxn positive definite, symmetric matrix, D Is a positive definite diagonal 
matrix, and Is the kxk Identity matrix. We present a less painful method 
of calculating these derivatives. 

Since 




3<B>. 


y6 




and 


3 m'^KbAb'^)"^ - (bdb’^) 


3<B> 


Y6 



» M'^{-(BAB’^)'^[J^gAB^ + BAJ^gl(BAB^)'^ 
+ (BDB'*’)“^[J^gDB^ + BDjiJgKBDB'^)"^}, 


then by Corollary 4. 2, 
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II - 2b/V +2 (BDb’^)"^'^(BDb'^)BA “ 2 (BDb'^)*V(BDb’^)'^BD 

Oo 

* 2[Ijj + (BDb’^)‘V(BDb'*‘)'’^]bA - 2(BDB'^)"V(BDB’^)"Sn. 


Since 


3m’‘^1(bAb'*’)"^ - (bdb'^)"^ - 1^1 


3<D> 


y6 


1, J T T _1 , T T —1 

^ » M^(BDB ) B b (BDB ) . 


then 


II = - b^(bdb’^)"Hi(bi)b’’^)"^b. 

In [9] Quireln extremizes the B-average inter class divergence 

Dj - I tr(Q) - afeii k; with 

A T —1 T 

Q - £ (BA B^) ^(BS.B^) 

1«1 

where B is kxn of rank k, A. is symmetric nxn of full rank, and S, is 

^ 1 

symmetric for i=l,...,m. (Actually, A^ is the covariance matrix for the 

m T 

ith class and ^ [A^ + where 6^^ » - y^, the difference 

Ui 

the 1th and Jth class means.) 

We again present an abbreviated version of the calculation. 
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Since 


yo i®!, 

(BA^b'^)“^Uy5A^b'^ + BA^ JYg)(BA^B'^)"^(BDB'^), 


then 


3D 


•^- E (BA^b'^)"^[BS^ - (bs^b'‘^)<bA^b'’^)”^a^1. 
i“l 


The next application will be In the problem of extremizlng B-average 
Interclass divergence In a reduced feature space with respect to the generator 
of a single Householder trans forma tlon» B, which compresses n~feature data 
to k features. In [2] Decell and Mayekar addressed this problem In 
modified form and Decell and others (see [1] and [3]) have made significant 
progress in the area of feature selection. The result below suggest another 
possible approach to the feature selection problem. 

T 

uu* 

Let Dq be as in previous example and B ■ (I. | j 5)(I “ 2 -s—)/ where 

B K 

(Ij^|z) is kKn with the kKk Identity, in the first kxk block 

and zeroes eis«:where, and U is a non-zero nKl vector. 

First/ 


3 (^) 


3<U> 


J 




(ru) 
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Using the fact that U^U, jju, and arc scalars, we can write 


3<^ {[UJ^U + u'^UJjU'f] - [udjuu*^ + Uu’^JjU^]). 

j (u U) 


Thus by Theorem 4.1, 


3U 


(u’’u) 




Thus at extrema we have 

- ^H(“g^)^(ijz) + (iJz) (“/)]u « 0. 

(U^U) u’^U OK k B 

Thus, to extreiiilze D„, it is necessary to solve the equation 

,ti|T t T 

(I - A(U) U = 0, where A(U) «= I(-^) (Ij^lZ) + <ljz) (— )], the nxn 

matrix of rank k. The above equality is equivalent to 

A(U) U = Xu for some X c R 
T 

since — s— projects in the U direction. 

U U 

The eignevalue problem A(U) U « XU ruggest possible iteration schemes, 
but a scheme with good conv^.rgence properties is yet to be determined and the 
question remains open. 
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